The research on life expectancy in countries takes the spotlight in the notebook's machine learning model. Substantial data analysis and predictive algorithms are used to uncover the reasons causing differences in longevity among countries. With the aid of strong statistical tools, valuable insights into the complex link between healthcare, socioeconomic factors, and life expectancy are sought
About the features:-
Country-Country name(target variable)
Year-year of survey
Status-Status of the country's development(target variable)
Population-Population of country
Hepatitis B-Percentage of people finally one year old who were immunized against hepatitis B
Measles-The number of reported measles cases per 1000 people
Polio-Percentage of 1-year-olds immunized against polio
Diphtheria-Percentage of people finally one year old who were immunized against diphtheria
HIV/AIDS-The number of deaths caused by AIDS of the last 4-year-olds who were born alive per 1000 people
infant deaths-The number of infant deaths per 1000 people
under-five deaths-The number of deaths of people under 5 years old per 1000 people
Total expenditure-The ratio of government medical-health expenses to total government expenses in percentage
GDP-Gross domestic product
BMI-The average body mass index of the entire population of the country
thinness 1-19 years-Prevalence of thinness among people 19 years old in percentage
Alcohol-Liters of alcohol consumption among people over 15 years old
Schooling-The number of years that people study
Life expectancy-Country life expectancy(target variable)
Aim:-
The dataset aims to provide insights into the relationship between the non target features and the target features. Researchers and data analysts can use this dataset to explore the impact of Hepatitus B, Measles, Polio, Diphtheria and HIV/AIDS on the Life expectancy and GDP, Schooling and Population on the Status of the country.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix
import geopandas as gpd
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
%matplotlib inline
df=pd.read_csv('life_expectancy.csv')
df.head(5)
| Country | Year | Status | Population | Hepatitis B | Measles | Polio | Diphtheria | HIV/AIDS | infant deaths | under-five deaths | Total expenditure | GDP | BMI | thinness 1-19 years | Alcohol | Schooling | Life expectancy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2015 | Developing | 33736494.0 | 65.0 | 1154 | 6.0 | 65.0 | 0.1 | 62 | 83 | 8.16 | 584.259210 | 19.1 | 17.2 | 0.01 | 10.1 | 65.0 |
| 1 | Afghanistan | 2014 | Developing | 327582.0 | 62.0 | 492 | 58.0 | 62.0 | 0.1 | 64 | 86 | 8.18 | 612.696514 | 18.6 | 17.5 | 0.01 | 10.0 | 59.9 |
| 2 | Afghanistan | 2013 | Developing | 31731688.0 | 64.0 | 430 | 62.0 | 64.0 | 0.1 | 66 | 89 | 8.13 | 631.744976 | 18.1 | 17.7 | 0.01 | 9.9 | 59.9 |
| 3 | Afghanistan | 2012 | Developing | 3696958.0 | 67.0 | 2787 | 67.0 | 67.0 | 0.1 | 69 | 93 | 8.52 | 669.959000 | 17.6 | 17.9 | 0.01 | 9.8 | 59.5 |
| 4 | Afghanistan | 2011 | Developing | 2978599.0 | 68.0 | 3013 | 68.0 | 68.0 | 0.1 | 71 | 97 | 7.87 | 63.537231 | 17.2 | 18.2 | 0.01 | 9.5 | 59.2 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2848 entries, 0 to 2847 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 2848 non-null object 1 Year 2848 non-null int64 2 Status 2848 non-null object 3 Population 2204 non-null float64 4 Hepatitis B 2306 non-null float64 5 Measles 2848 non-null int64 6 Polio 2829 non-null float64 7 Diphtheria 2829 non-null float64 8 HIV/AIDS 2848 non-null float64 9 infant deaths 2848 non-null int64 10 under-five deaths 2848 non-null int64 11 Total expenditure 2627 non-null float64 12 GDP 2406 non-null float64 13 BMI 2816 non-null float64 14 thinness 1-19 years 2816 non-null float64 15 Alcohol 2660 non-null float64 16 Schooling 2688 non-null float64 17 Life expectancy 2848 non-null float64 dtypes: float64(12), int64(4), object(2) memory usage: 400.6+ KB
df.describe()
| Year | Population | Hepatitis B | Measles | Polio | Diphtheria | HIV/AIDS | infant deaths | under-five deaths | Total expenditure | GDP | BMI | thinness 1-19 years | Alcohol | Schooling | Life expectancy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2848.000000 | 2.204000e+03 | 2306.000000 | 2848.000000 | 2829.000000 | 2829.000000 | 2848.000000 | 2848.000000 | 2848.000000 | 2627.000000 | 2406.000000 | 2816.000000 | 2816.000000 | 2660.000000 | 2688.000000 | 2848.000000 |
| mean | 2007.500000 | 1.283457e+07 | 81.076756 | 2083.082163 | 82.682220 | 82.451396 | 1.756461 | 28.359902 | 39.500000 | 5.935577 | 7664.398813 | 38.503374 | 4.847230 | 4.638932 | 12.060156 | 69.347402 |
| std | 4.610582 | 6.196094e+07 | 25.019068 | 10249.107207 | 23.434954 | 23.693936 | 5.148935 | 117.188032 | 159.800866 | 2.504439 | 14466.241793 | 19.955485 | 4.443695 | 4.064721 | 3.320160 | 9.528332 |
| min | 2000.000000 | 3.400000e+01 | 1.000000 | 0.000000 | 3.000000 | 2.000000 | 0.100000 | 0.000000 | 0.000000 | 0.370000 | 1.681350 | 1.000000 | 0.100000 | 0.010000 | 0.000000 | 36.300000 |
| 25% | 2003.750000 | 1.967585e+05 | 77.000000 | 0.000000 | 78.000000 | 78.000000 | 0.100000 | 0.000000 | 0.000000 | 4.240000 | 477.541713 | 19.500000 | 1.600000 | 0.930000 | 10.200000 | 63.500000 |
| 50% | 2007.500000 | 1.391756e+06 | 92.000000 | 16.000000 | 93.000000 | 93.000000 | 0.100000 | 3.000000 | 4.000000 | 5.760000 | 1841.086830 | 43.900000 | 3.300000 | 3.785000 | 12.400000 | 72.200000 |
| 75% | 2011.250000 | 7.438947e+06 | 97.000000 | 336.750000 | 97.000000 | 97.000000 | 0.700000 | 20.000000 | 25.000000 | 7.530000 | 6265.658907 | 56.200000 | 7.125000 | 7.810000 | 14.300000 | 75.800000 |
| max | 2015.000000 | 1.293859e+09 | 99.000000 | 212183.000000 | 99.000000 | 99.000000 | 50.600000 | 1800.000000 | 2500.000000 | 17.600000 | 119172.741800 | 77.600000 | 27.700000 | 17.870000 | 20.700000 | 89.000000 |
df.isnull().sum()
Country 0 Year 0 Status 0 Population 644 Hepatitis B 542 Measles 0 Polio 19 Diphtheria 19 HIV/AIDS 0 infant deaths 0 under-five deaths 0 Total expenditure 221 GDP 442 BMI 32 thinness 1-19 years 32 Alcohol 188 Schooling 160 Life expectancy 0 dtype: int64
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
<AxesSubplot:>
df.columns
Index(['Country', 'Year', 'Status', 'Population', 'Hepatitis B', 'Measles',
'Polio', 'Diphtheria', 'HIV/AIDS', 'infant deaths', 'under-five deaths',
'Total expenditure', 'GDP', 'BMI', 'thinness 1-19 years', 'Alcohol',
'Schooling', 'Life expectancy'],
dtype='object')
df[df['Population'].isnull()].head(5)
| Country | Year | Status | Population | Hepatitis B | Measles | Polio | Diphtheria | HIV/AIDS | infant deaths | under-five deaths | Total expenditure | GDP | BMI | thinness 1-19 years | Alcohol | Schooling | Life expectancy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 64 | Antigua and Barbuda | 2015 | Developing | NaN | 99.0 | 0 | 86.0 | 99.0 | 0.2 | 0 | 0 | NaN | 13566.95410 | 47.7 | 3.3 | NaN | 13.9 | 76.4 |
| 65 | Antigua and Barbuda | 2014 | Developing | NaN | 99.0 | 0 | 96.0 | 99.0 | 0.2 | 0 | 0 | 5.54 | 12888.29667 | 47.0 | 3.3 | 8.56 | 13.9 | 76.2 |
| 66 | Antigua and Barbuda | 2013 | Developing | NaN | 99.0 | 0 | 98.0 | 99.0 | 0.2 | 0 | 0 | 5.33 | 12224.86416 | 46.4 | 3.3 | 8.58 | 13.9 | 76.1 |
| 67 | Antigua and Barbuda | 2012 | Developing | NaN | 98.0 | 0 | 97.0 | 98.0 | 0.2 | 0 | 0 | 5.39 | 12565.44197 | 45.7 | 3.3 | 8.18 | 13.8 | 75.9 |
| 68 | Antigua and Barbuda | 2011 | Developing | NaN | 99.0 | 0 | 99.0 | 99.0 | 0.1 | 0 | 0 | 5.65 | 11929.34991 | 45.1 | 3.3 | 7.84 | 14.1 | 75.7 |
df['Population'].fillna(df['Population'].mean(),inplace=True)
We are filling it with the mean of the column since it contains numerical data.
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
<AxesSubplot:>
As there are no more null values in Population column, the column does not require further data cleaning.
df[df['Hepatitis B'].isnull()].head(5)
| Country | Year | Status | Population | Hepatitis B | Measles | Polio | Diphtheria | HIV/AIDS | infant deaths | under-five deaths | Total expenditure | GDP | BMI | thinness 1-19 years | Alcohol | Schooling | Life expectancy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 44 | Algeria | 2003 | Developing | 3243514.0 | NaN | 15374 | 87.0 | 87.0 | 0.1 | 20 | 23 | 3.60 | 294.335560 | 47.0 | 6.3 | 0.34 | 11.5 | 71.7 |
| 45 | Algeria | 2002 | Developing | 3199546.0 | NaN | 5862 | 86.0 | 86.0 | 0.1 | 20 | 23 | 3.73 | 1774.336730 | 46.1 | 6.3 | 0.36 | 11.1 | 71.6 |
| 46 | Algeria | 2001 | Developing | 31592153.0 | NaN | 2686 | 89.0 | 89.0 | 0.1 | 20 | 24 | 3.84 | 1732.857979 | 45.3 | 6.4 | 0.23 | 10.9 | 71.4 |
| 47 | Algeria | 2000 | Developing | 3118366.0 | NaN | 0 | 86.0 | 86.0 | 0.1 | 21 | 25 | 3.49 | 1757.177970 | 44.4 | 6.5 | 0.25 | 10.7 | 71.3 |
| 57 | Angola | 2006 | Developing | 2262399.0 | NaN | 765 | 36.0 | 34.0 | 2.5 | 90 | 143 | 4.54 | 262.415149 | 18.2 | 9.8 | 5.84 | 7.2 | 47.7 |
df['Hepatitis B'].fillna(df['Hepatitis B'].mean(),inplace=True)
We are filling it with the mean of the column since it contains numerical data.
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
<AxesSubplot:>
As there are no more null values in Hepatitis B column, the column does not require further data cleaning.
df[df['Polio'].isnull()].head(5)
| Country | Year | Status | Population | Hepatitis B | Measles | Polio | Diphtheria | HIV/AIDS | infant deaths | under-five deaths | Total expenditure | GDP | BMI | thinness 1-19 years | Alcohol | Schooling | Life expectancy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1674 | Montenegro | 2005 | Developing | 614261.0 | 81.076756 | 0 | NaN | NaN | 0.1 | 0 | 0 | 8.46 | 3674.617924 | 55.7 | 2.3 | NaN | 12.8 | 73.6 |
| 1675 | Montenegro | 2004 | Developing | 613353.0 | 81.076756 | 0 | NaN | NaN | 0.1 | 0 | 0 | 8.45 | 338.199535 | 55.0 | 2.3 | 0.01 | 12.6 | 73.5 |
| 1676 | Montenegro | 2003 | Developing | 612267.0 | 81.076756 | 0 | NaN | NaN | 0.1 | 0 | 0 | 8.91 | 2789.173500 | 54.2 | 2.4 | 0.01 | 0.0 | 73.5 |
| 1677 | Montenegro | 2002 | Developing | 69828.0 | 81.076756 | 0 | NaN | NaN | 0.1 | 0 | 0 | 8.33 | 216.243274 | 53.5 | 2.5 | 0.01 | 0.0 | 73.4 |
| 1678 | Montenegro | 2001 | Developing | 67389.0 | 81.076756 | 0 | NaN | NaN | 0.1 | 0 | 0 | 8.23 | 199.583957 | 52.7 | 2.5 | 0.01 | 0.0 | 73.3 |
df['Polio'].fillna(df['Polio'].mean(),inplace=True)
We are filling it with the mean of the column since it contains numerical data.
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
<AxesSubplot:>
As there are no more null values in Polio column, the column does not require further data cleaning.
df[df['Diphtheria'].isnull()].head(5)
| Country | Year | Status | Population | Hepatitis B | Measles | Polio | Diphtheria | HIV/AIDS | infant deaths | under-five deaths | Total expenditure | GDP | BMI | thinness 1-19 years | Alcohol | Schooling | Life expectancy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1674 | Montenegro | 2005 | Developing | 614261.0 | 81.076756 | 0 | 82.68222 | NaN | 0.1 | 0 | 0 | 8.46 | 3674.617924 | 55.7 | 2.3 | NaN | 12.8 | 73.6 |
| 1675 | Montenegro | 2004 | Developing | 613353.0 | 81.076756 | 0 | 82.68222 | NaN | 0.1 | 0 | 0 | 8.45 | 338.199535 | 55.0 | 2.3 | 0.01 | 12.6 | 73.5 |
| 1676 | Montenegro | 2003 | Developing | 612267.0 | 81.076756 | 0 | 82.68222 | NaN | 0.1 | 0 | 0 | 8.91 | 2789.173500 | 54.2 | 2.4 | 0.01 | 0.0 | 73.5 |
| 1677 | Montenegro | 2002 | Developing | 69828.0 | 81.076756 | 0 | 82.68222 | NaN | 0.1 | 0 | 0 | 8.33 | 216.243274 | 53.5 | 2.5 | 0.01 | 0.0 | 73.4 |
| 1678 | Montenegro | 2001 | Developing | 67389.0 | 81.076756 | 0 | 82.68222 | NaN | 0.1 | 0 | 0 | 8.23 | 199.583957 | 52.7 | 2.5 | 0.01 | 0.0 | 73.3 |
df['Diphtheria'].fillna(df['Diphtheria'].mean(),inplace=True)
We are filling it with the mean of the column since it contains numerical data.
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
<AxesSubplot:>
As there are no more null values in Diptheria column, the column does not require further data cleaning.
df[df['Total expenditure'].isnull()].head(5)
| Country | Year | Status | Population | Hepatitis B | Measles | Polio | Diphtheria | HIV/AIDS | infant deaths | under-five deaths | Total expenditure | GDP | BMI | thinness 1-19 years | Alcohol | Schooling | Life expectancy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 32 | Algeria | 2015 | Developing | 3.987153e+07 | 95.0 | 63 | 95.0 | 95.0 | 0.1 | 21 | 24 | NaN | 4132.762920 | 59.5 | 6.0 | NaN | 14.4 | 75.6 |
| 48 | Angola | 2015 | Developing | 2.785935e+06 | 64.0 | 118 | 7.0 | 64.0 | 1.9 | 66 | 98 | NaN | 3695.793748 | 23.3 | 8.3 | NaN | 11.4 | 52.4 |
| 64 | Antigua and Barbuda | 2015 | Developing | 1.283457e+07 | 99.0 | 0 | 86.0 | 99.0 | 0.2 | 0 | 0 | NaN | 13566.954100 | 47.7 | 3.3 | NaN | 13.9 | 76.4 |
| 80 | Argentina | 2015 | Developing | 4.341776e+07 | 94.0 | 0 | 93.0 | 94.0 | 0.1 | 8 | 9 | NaN | 13467.123600 | 62.8 | 1.0 | NaN | 17.3 | 76.3 |
| 96 | Armenia | 2015 | Developing | 2.916950e+05 | 94.0 | 33 | 96.0 | 94.0 | 0.1 | 1 | 1 | NaN | 369.654776 | 54.9 | 2.1 | NaN | 12.7 | 74.8 |
df['Total expenditure'].fillna(df['Total expenditure'].mean() ,inplace=True)
We are filling it with the mean of the column since it contains numerical data.
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
<AxesSubplot:>
As there are no more null values in Total expenditure column, the column does not require further data cleaning.
df[df['GDP'].isnull()].head(5)
| Country | Year | Status | Population | Hepatitis B | Measles | Polio | Diphtheria | HIV/AIDS | infant deaths | under-five deaths | Total expenditure | GDP | BMI | thinness 1-19 years | Alcohol | Schooling | Life expectancy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 160 | Bahamas | 2015 | Developing | 1.283457e+07 | 95.0 | 0 | 95.0 | 95.0 | 0.1 | 0 | 0 | 5.935577 | NaN | 64.5 | 2.5 | NaN | 12.6 | 76.1 |
| 161 | Bahamas | 2014 | Developing | 1.283457e+07 | 96.0 | 0 | 96.0 | 96.0 | 0.1 | 0 | 0 | 7.740000 | NaN | 63.8 | 2.5 | 9.45 | 12.6 | 75.4 |
| 162 | Bahamas | 2013 | Developing | 1.283457e+07 | 97.0 | 0 | 97.0 | 97.0 | 0.1 | 0 | 0 | 7.500000 | NaN | 63.2 | 2.5 | 9.42 | 12.6 | 74.8 |
| 163 | Bahamas | 2012 | Developing | 1.283457e+07 | 96.0 | 0 | 99.0 | 98.0 | 0.2 | 0 | 0 | 7.430000 | NaN | 62.6 | 2.5 | 9.50 | 12.6 | 74.9 |
| 164 | Bahamas | 2011 | Developing | 1.283457e+07 | 95.0 | 0 | 97.0 | 98.0 | 0.1 | 0 | 0 | 7.630000 | NaN | 62.0 | 2.5 | 9.34 | 12.6 | 75.0 |
df['GDP'].fillna(df['GDP'].mean(),inplace=True)
We are filling it with the mean of the column since it contains numerical data.
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
<AxesSubplot:>
As there are no more null values in GDP column, the column does not require further data cleaning.
df[df['BMI'].isnull()].head(5)
| Country | Year | Status | Population | Hepatitis B | Measles | Polio | Diphtheria | HIV/AIDS | infant deaths | under-five deaths | Total expenditure | GDP | BMI | thinness 1-19 years | Alcohol | Schooling | Life expectancy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2320 | South Sudan | 2015 | Developing | 11882136.0 | 31.000000 | 878 | 41.0 | 31.0 | 3.4 | 26 | 39 | 5.935577 | 758.725782 | NaN | NaN | NaN | 4.9 | 57.3 |
| 2321 | South Sudan | 2014 | Developing | 1153971.0 | 81.076756 | 441 | 44.0 | 39.0 | 3.5 | 26 | 39 | 2.740000 | 1151.861715 | NaN | NaN | NaN | 4.9 | 56.6 |
| 2322 | South Sudan | 2013 | Developing | 1117749.0 | 81.076756 | 525 | 5.0 | 45.0 | 3.6 | 26 | 40 | 2.620000 | 1186.113250 | NaN | NaN | NaN | 4.9 | 56.4 |
| 2323 | South Sudan | 2012 | Developing | 1818258.0 | 81.076756 | 1952 | 64.0 | 59.0 | 3.8 | 26 | 40 | 2.770000 | 958.455810 | NaN | NaN | NaN | 4.9 | 56.0 |
| 2324 | South Sudan | 2011 | Developing | 1448857.0 | 81.076756 | 1256 | 66.0 | 61.0 | 3.9 | 27 | 41 | 5.935577 | 176.971300 | NaN | NaN | NaN | 4.9 | 55.4 |
df['BMI'].fillna(df['BMI'].mean(),inplace=True)
We are filling it with the mean of the column since it contains numerical data.
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
<AxesSubplot:>
As there are no more null values in BMI column, the column does not require further data cleaning.
df[df['thinness 1-19 years'].isnull()].head(5)
| Country | Year | Status | Population | Hepatitis B | Measles | Polio | Diphtheria | HIV/AIDS | infant deaths | under-five deaths | Total expenditure | GDP | BMI | thinness 1-19 years | Alcohol | Schooling | Life expectancy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2320 | South Sudan | 2015 | Developing | 11882136.0 | 31.000000 | 878 | 41.0 | 31.0 | 3.4 | 26 | 39 | 5.935577 | 758.725782 | 38.503374 | NaN | NaN | 4.9 | 57.3 |
| 2321 | South Sudan | 2014 | Developing | 1153971.0 | 81.076756 | 441 | 44.0 | 39.0 | 3.5 | 26 | 39 | 2.740000 | 1151.861715 | 38.503374 | NaN | NaN | 4.9 | 56.6 |
| 2322 | South Sudan | 2013 | Developing | 1117749.0 | 81.076756 | 525 | 5.0 | 45.0 | 3.6 | 26 | 40 | 2.620000 | 1186.113250 | 38.503374 | NaN | NaN | 4.9 | 56.4 |
| 2323 | South Sudan | 2012 | Developing | 1818258.0 | 81.076756 | 1952 | 64.0 | 59.0 | 3.8 | 26 | 40 | 2.770000 | 958.455810 | 38.503374 | NaN | NaN | 4.9 | 56.0 |
| 2324 | South Sudan | 2011 | Developing | 1448857.0 | 81.076756 | 1256 | 66.0 | 61.0 | 3.9 | 27 | 41 | 5.935577 | 176.971300 | 38.503374 | NaN | NaN | 4.9 | 55.4 |
df['thinness 1-19 years'].fillna(df['thinness 1-19 years'].mean(),inplace=True)
We are filling it with the mean of the column since it contains numerical data.
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
<AxesSubplot:>
As there are no more null values in thinness 1-19 years column, the column does not require further data cleaning.
df[df['Alcohol'].isnull()].head(5)
| Country | Year | Status | Population | Hepatitis B | Measles | Polio | Diphtheria | HIV/AIDS | infant deaths | under-five deaths | Total expenditure | GDP | BMI | thinness 1-19 years | Alcohol | Schooling | Life expectancy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 32 | Algeria | 2015 | Developing | 3.987153e+07 | 95.0 | 63 | 95.0 | 95.0 | 0.1 | 21 | 24 | 5.935577 | 4132.762920 | 59.5 | 6.0 | NaN | 14.4 | 75.6 |
| 48 | Angola | 2015 | Developing | 2.785935e+06 | 64.0 | 118 | 7.0 | 64.0 | 1.9 | 66 | 98 | 5.935577 | 3695.793748 | 23.3 | 8.3 | NaN | 11.4 | 52.4 |
| 64 | Antigua and Barbuda | 2015 | Developing | 1.283457e+07 | 99.0 | 0 | 86.0 | 99.0 | 0.2 | 0 | 0 | 5.935577 | 13566.954100 | 47.7 | 3.3 | NaN | 13.9 | 76.4 |
| 80 | Argentina | 2015 | Developing | 4.341776e+07 | 94.0 | 0 | 93.0 | 94.0 | 0.1 | 8 | 9 | 5.935577 | 13467.123600 | 62.8 | 1.0 | NaN | 17.3 | 76.3 |
| 96 | Armenia | 2015 | Developing | 2.916950e+05 | 94.0 | 33 | 96.0 | 94.0 | 0.1 | 1 | 1 | 5.935577 | 369.654776 | 54.9 | 2.1 | NaN | 12.7 | 74.8 |
df['Alcohol'].fillna(df['Alcohol'].mean(),inplace=True)
We are filling it with the mean of the column since it contains numerical data.
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
<AxesSubplot:>
As there are no more null values in Alcohol column, the column does not require further data cleaning.
df[df['Schooling'].isnull()].head(5)
| Country | Year | Status | Population | Hepatitis B | Measles | Polio | Diphtheria | HIV/AIDS | infant deaths | under-five deaths | Total expenditure | GDP | BMI | thinness 1-19 years | Alcohol | Schooling | Life expectancy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 416 | Côte d'Ivoire | 2015 | Developing | 1.283457e+07 | 83.0 | 65 | 81.0 | 83.0 | 1.9 | 57 | 79 | 5.935577 | 7664.398813 | 28.0 | 5.5 | 4.638932 | NaN | 53.3 |
| 417 | Côte d'Ivoire | 2014 | Developing | 1.283457e+07 | 76.0 | 50 | 76.0 | 76.0 | 2.0 | 58 | 80 | 5.720000 | 7664.398813 | 27.4 | 5.6 | 0.010000 | NaN | 52.8 |
| 418 | Côte d'Ivoire | 2013 | Developing | 1.283457e+07 | 8.0 | 48 | 79.0 | 8.0 | 2.4 | 59 | 81 | 5.810000 | 7664.398813 | 26.8 | 5.8 | 3.150000 | NaN | 52.3 |
| 419 | Côte d'Ivoire | 2012 | Developing | 1.283457e+07 | 82.0 | 137 | 83.0 | 82.0 | 2.9 | 59 | 82 | 6.140000 | 7664.398813 | 26.2 | 5.9 | 3.240000 | NaN | 52.0 |
| 420 | Côte d'Ivoire | 2011 | Developing | 1.283457e+07 | 62.0 | 628 | 58.0 | 62.0 | 3.3 | 60 | 83 | 6.420000 | 7664.398813 | 25.6 | 6.1 | 3.130000 | NaN | 51.7 |
df['Schooling'].fillna(df['Schooling'].mean(),inplace=True)
We are filling it with the mean of the column since it contains numerical data.
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
<AxesSubplot:>
As there are no more null values in Schooling column, the column does not require further data cleaning.
le=LabelEncoder()
le.fit(df['Country'])
LabelEncoder()
df['Country_labeled']=le.transform(df['Country'])
df.head(5)
| Country | Year | Status | Population | Hepatitis B | Measles | Polio | Diphtheria | HIV/AIDS | infant deaths | under-five deaths | Total expenditure | GDP | BMI | thinness 1-19 years | Alcohol | Schooling | Life expectancy | Country_labeled | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2015 | Developing | 33736494.0 | 65.0 | 1154 | 6.0 | 65.0 | 0.1 | 62 | 83 | 8.16 | 584.259210 | 19.1 | 17.2 | 0.01 | 10.1 | 65.0 | 0 |
| 1 | Afghanistan | 2014 | Developing | 327582.0 | 62.0 | 492 | 58.0 | 62.0 | 0.1 | 64 | 86 | 8.18 | 612.696514 | 18.6 | 17.5 | 0.01 | 10.0 | 59.9 | 0 |
| 2 | Afghanistan | 2013 | Developing | 31731688.0 | 64.0 | 430 | 62.0 | 64.0 | 0.1 | 66 | 89 | 8.13 | 631.744976 | 18.1 | 17.7 | 0.01 | 9.9 | 59.9 | 0 |
| 3 | Afghanistan | 2012 | Developing | 3696958.0 | 67.0 | 2787 | 67.0 | 67.0 | 0.1 | 69 | 93 | 8.52 | 669.959000 | 17.6 | 17.9 | 0.01 | 9.8 | 59.5 | 0 |
| 4 | Afghanistan | 2011 | Developing | 2978599.0 | 68.0 | 3013 | 68.0 | 68.0 | 0.1 | 71 | 97 | 7.87 | 63.537231 | 17.2 | 18.2 | 0.01 | 9.5 | 59.2 | 0 |
As we can see a new column named Country_labeled is added in the last
df=df.drop('infant deaths',axis=1)
Since under 5 deaths already covers the infant deaths, it is better to use that feature only.
def func(elem):
if elem=='Developing':
return 0
else:
return 1
df['Status']=df['Status'].apply(lambda x: func(x))
df[df['Status']==1].head(5)
| Country | Year | Status | Population | Hepatitis B | Measles | Polio | Diphtheria | HIV/AIDS | under-five deaths | Total expenditure | GDP | BMI | thinness 1-19 years | Alcohol | Schooling | Life expectancy | Country_labeled | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 112 | Australia | 2015 | 1 | 23789338.0 | 93.0 | 74 | 93.0 | 93.0 | 0.1 | 1 | 5.935577 | 56554.38760 | 66.6 | 0.6 | 4.638932 | 20.4 | 82.8 | 7 |
| 113 | Australia | 2014 | 1 | 2346694.0 | 91.0 | 340 | 92.0 | 92.0 | 0.1 | 1 | 9.420000 | 62214.69120 | 66.1 | 0.6 | 9.710000 | 20.4 | 82.7 | 7 |
| 114 | Australia | 2013 | 1 | 23117353.0 | 91.0 | 158 | 91.0 | 91.0 | 0.1 | 1 | 9.360000 | 67792.33860 | 65.5 | 0.6 | 9.870000 | 20.3 | 82.5 | 7 |
| 115 | Australia | 2012 | 1 | 22728254.0 | 91.0 | 199 | 92.0 | 92.0 | 0.1 | 1 | 9.360000 | 67677.63477 | 65.0 | 0.6 | 10.030000 | 20.1 | 82.3 | 7 |
| 116 | Australia | 2011 | 1 | 223424.0 | 92.0 | 190 | 92.0 | 92.0 | 0.1 | 1 | 9.200000 | 62245.12900 | 64.4 | 0.6 | 10.300000 | 19.8 | 82.0 | 7 |
df[df['Status']==0].head(5)
| Country | Year | Status | Population | Hepatitis B | Measles | Polio | Diphtheria | HIV/AIDS | under-five deaths | Total expenditure | GDP | BMI | thinness 1-19 years | Alcohol | Schooling | Life expectancy | Country_labeled | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2015 | 0 | 33736494.0 | 65.0 | 1154 | 6.0 | 65.0 | 0.1 | 83 | 8.16 | 584.259210 | 19.1 | 17.2 | 0.01 | 10.1 | 65.0 | 0 |
| 1 | Afghanistan | 2014 | 0 | 327582.0 | 62.0 | 492 | 58.0 | 62.0 | 0.1 | 86 | 8.18 | 612.696514 | 18.6 | 17.5 | 0.01 | 10.0 | 59.9 | 0 |
| 2 | Afghanistan | 2013 | 0 | 31731688.0 | 64.0 | 430 | 62.0 | 64.0 | 0.1 | 89 | 8.13 | 631.744976 | 18.1 | 17.7 | 0.01 | 9.9 | 59.9 | 0 |
| 3 | Afghanistan | 2012 | 0 | 3696958.0 | 67.0 | 2787 | 67.0 | 67.0 | 0.1 | 93 | 8.52 | 669.959000 | 17.6 | 17.9 | 0.01 | 9.8 | 59.5 | 0 |
| 4 | Afghanistan | 2011 | 0 | 2978599.0 | 68.0 | 3013 | 68.0 | 68.0 | 0.1 | 97 | 7.87 | 63.537231 | 17.2 | 18.2 | 0.01 | 9.5 | 59.2 | 0 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2848 entries, 0 to 2847 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 2848 non-null object 1 Year 2848 non-null int64 2 Status 2848 non-null int64 3 Population 2848 non-null float64 4 Hepatitis B 2848 non-null float64 5 Measles 2848 non-null int64 6 Polio 2848 non-null float64 7 Diphtheria 2848 non-null float64 8 HIV/AIDS 2848 non-null float64 9 under-five deaths 2848 non-null int64 10 Total expenditure 2848 non-null float64 11 GDP 2848 non-null float64 12 BMI 2848 non-null float64 13 thinness 1-19 years 2848 non-null float64 14 Alcohol 2848 non-null float64 15 Schooling 2848 non-null float64 16 Life expectancy 2848 non-null float64 17 Country_labeled 2848 non-null int32 dtypes: float64(12), int32(1), int64(4), object(1) memory usage: 389.5+ KB
df.describe()
| Year | Status | Population | Hepatitis B | Measles | Polio | Diphtheria | HIV/AIDS | under-five deaths | Total expenditure | GDP | BMI | thinness 1-19 years | Alcohol | Schooling | Life expectancy | Country_labeled | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2848.000000 | 2848.000000 | 2.848000e+03 | 2848.000000 | 2848.000000 | 2848.000000 | 2848.000000 | 2848.000000 | 2848.000000 | 2848.000000 | 2848.000000 | 2848.000000 | 2848.000000 | 2848.000000 | 2848.000000 | 2848.000000 | 2848.000000 |
| mean | 2007.500000 | 0.174157 | 1.283457e+07 | 81.076756 | 2083.082163 | 82.682220 | 82.451396 | 1.756461 | 39.500000 | 5.935577 | 7664.398813 | 38.503374 | 4.847230 | 4.638932 | 12.060156 | 69.347402 | 88.500000 |
| std | 4.610582 | 0.379311 | 5.450439e+07 | 22.511937 | 10249.107207 | 23.356624 | 23.614741 | 5.148935 | 159.800866 | 2.405271 | 13295.954693 | 19.843020 | 4.418652 | 3.928223 | 3.225515 | 9.528332 | 51.392386 |
| min | 2000.000000 | 0.000000 | 3.400000e+01 | 1.000000 | 0.000000 | 3.000000 | 2.000000 | 0.100000 | 0.000000 | 0.370000 | 1.681350 | 1.000000 | 0.100000 | 0.010000 | 0.000000 | 36.300000 | 0.000000 |
| 25% | 2003.750000 | 0.000000 | 4.283075e+05 | 81.076756 | 0.000000 | 78.000000 | 79.000000 | 0.100000 | 0.000000 | 4.340000 | 615.809250 | 19.600000 | 1.600000 | 1.137500 | 10.400000 | 63.500000 | 44.000000 |
| 50% | 2007.500000 | 0.000000 | 3.742413e+06 | 87.000000 | 16.000000 | 93.000000 | 93.000000 | 0.100000 | 4.000000 | 5.935577 | 3217.098778 | 43.500000 | 3.400000 | 4.175000 | 12.200000 | 72.200000 | 88.500000 |
| 75% | 2011.250000 | 0.000000 | 1.283457e+07 | 96.000000 | 336.750000 | 97.000000 | 97.000000 | 0.700000 | 25.000000 | 7.370000 | 7664.398813 | 56.100000 | 7.100000 | 7.467500 | 14.125000 | 75.800000 | 133.000000 |
| max | 2015.000000 | 1.000000 | 1.293859e+09 | 99.000000 | 212183.000000 | 99.000000 | 99.000000 | 50.600000 | 2500.000000 | 17.600000 | 119172.741800 | 77.600000 | 27.700000 | 17.870000 | 20.700000 | 89.000000 | 177.000000 |
sc = StandardScaler()
dsc=sc.fit_transform(df.drop('Country',axis=1))
cols=['Year', 'Status', 'Population', 'Hepatitis B', 'Measles',
'Polio', 'Diphtheria', 'HIV/AIDS', 'under-five deaths',
'Total expenditure', 'GDP', 'BMI', 'thinness 1-19 years', 'Alcohol',
'Schooling', 'Life expectancy','Country_labeled']
dsc=pd.DataFrame(dsc,columns=cols)
dsc.head(5)
| Year | Status | Population | Hepatitis B | Measles | Polio | Diphtheria | HIV/AIDS | under-five deaths | Total expenditure | GDP | BMI | thinness 1-19 years | Alcohol | Schooling | Life expectancy | Country_labeled | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.626978 | -0.459221 | 0.383558 | -0.714269 | -0.090666 | -3.283680 | -0.739134 | -0.321766 | 0.272262 | 0.924974 | -0.532597 | -0.978016 | 2.796088 | -1.178585 | -0.607810 | -0.456341 | -1.722347 |
| 1 | 1.410048 | -0.459221 | -0.229508 | -0.847555 | -0.155268 | -1.056940 | -0.866196 | -0.321766 | 0.291038 | 0.933291 | -0.530458 | -1.003218 | 2.863994 | -1.178585 | -0.638818 | -0.991680 | -1.722347 |
| 2 | 1.193118 | -0.459221 | 0.346769 | -0.758698 | -0.161319 | -0.885652 | -0.781488 | -0.321766 | 0.309815 | 0.912500 | -0.529025 | -1.028420 | 2.909265 | -1.178585 | -0.669826 | -0.991680 | -1.722347 |
| 3 | 0.976187 | -0.459221 | -0.167679 | -0.625412 | 0.068693 | -0.671543 | -0.654426 | -0.321766 | 0.334850 | 1.074672 | -0.526150 | -1.053622 | 2.954535 | -1.178585 | -0.700835 | -1.033668 | -1.722347 |
| 4 | 0.759257 | -0.459221 | -0.180861 | -0.580983 | 0.090748 | -0.628721 | -0.612073 | -0.321766 | 0.359886 | 0.804385 | -0.571768 | -1.073784 | 3.022441 | -1.178585 | -0.793859 | -1.065158 | -1.722347 |
dcorr=dsc.corr()
dcorr
| Year | Status | Population | Hepatitis B | Measles | Polio | Diphtheria | HIV/AIDS | under-five deaths | Total expenditure | GDP | BMI | thinness 1-19 years | Alcohol | Schooling | Life expectancy | Country_labeled | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Year | 1.000000e+00 | 6.081278e-18 | 0.014941 | 0.086101 | -0.083160 | 0.094297 | 0.128227 | -0.137193 | -0.036478 | 0.078453 | 0.094081 | 0.106048 | -0.039098 | -0.044714 | 0.202957 | 0.169144 | -3.430464e-18 |
| Status | 6.081278e-18 | 1.000000e+00 | -0.039269 | 0.091944 | -0.070272 | 0.216535 | 0.213482 | -0.147762 | -0.108160 | 0.296801 | 0.448637 | 0.308236 | -0.370209 | 0.576962 | 0.488124 | 0.483330 | 2.695564e-02 |
| Population | 1.494095e-02 | -3.926859e-02 | 1.000000 | -0.111072 | 0.276094 | -0.029801 | -0.021611 | -0.028679 | 0.545203 | -0.065690 | -0.025420 | -0.061766 | 0.236788 | -0.028398 | -0.026595 | -0.017603 | -1.373790e-02 |
| Hepatitis B | 8.610088e-02 | 9.194366e-02 | -0.111072 | 1.000000 | -0.097171 | 0.405061 | 0.501157 | -0.104605 | -0.183739 | 0.059638 | 0.060731 | 0.139434 | -0.111690 | 0.066520 | 0.158324 | 0.204348 | -1.363795e-02 |
| Measles | -8.316023e-02 | -7.027200e-02 | 0.276094 | -0.097171 | 1.000000 | -0.169092 | -0.174605 | 0.045853 | 0.512826 | -0.102828 | -0.062485 | -0.174733 | 0.258073 | -0.050377 | -0.129975 | -0.198178 | 1.101742e-02 |
| Polio | 9.429706e-02 | 2.165350e-01 | -0.029801 | 0.405061 | -0.169092 | 1.000000 | 0.672972 | -0.159411 | -0.188233 | 0.135440 | 0.192384 | 0.285454 | -0.224705 | 0.205739 | 0.377328 | 0.461290 | 2.104899e-02 |
| Diphtheria | 1.282273e-01 | 2.134815e-01 | -0.021611 | 0.501157 | -0.174605 | 0.672972 | 1.000000 | -0.162605 | -0.191159 | 0.149035 | 0.181221 | 0.282239 | -0.223666 | 0.208102 | 0.378301 | 0.473636 | -5.377488e-03 |
| HIV/AIDS | -1.371929e-01 | -1.477618e-01 | -0.028679 | -0.104605 | 0.045853 | -0.159411 | -0.162605 | 1.000000 | 0.039157 | -0.000318 | -0.135291 | -0.244326 | 0.200424 | -0.048878 | -0.218795 | -0.559105 | 9.449946e-02 |
| under-five deaths | -3.647839e-02 | -1.081597e-01 | 0.545203 | -0.183739 | 0.512826 | -0.188233 | -0.191159 | 0.039157 | 1.000000 | -0.122666 | -0.104517 | -0.224889 | 0.470517 | -0.104747 | -0.197999 | -0.224903 | -7.683976e-03 |
| Total expenditure | 7.845277e-02 | 2.968014e-01 | -0.065690 | 0.059638 | -0.102828 | 0.135440 | 0.149035 | -0.000318 | -0.122666 | 1.000000 | 0.123522 | 0.230964 | -0.270532 | 0.302061 | 0.235277 | 0.214010 | 5.117823e-02 |
| GDP | 9.408126e-02 | 4.486375e-01 | -0.025420 | 0.060731 | -0.062485 | 0.192384 | 0.181221 | -0.135291 | -0.104517 | 0.123522 | 1.000000 | 0.275412 | -0.269273 | 0.317181 | 0.428341 | 0.428644 | -2.168154e-02 |
| BMI | 1.060478e-01 | 3.082362e-01 | -0.061766 | 0.139434 | -0.174733 | 0.285454 | 0.282239 | -0.244326 | -0.224889 | 0.230964 | 0.275412 | 1.000000 | -0.530313 | 0.322564 | 0.512410 | 0.562165 | -1.358444e-03 |
| thinness 1-19 years | -3.909783e-02 | -3.702091e-01 | 0.236788 | -0.111690 | 0.258073 | -0.224705 | -0.223666 | 0.200424 | 0.470517 | -0.270532 | -0.269273 | -0.530313 | 1.000000 | -0.424998 | -0.451633 | -0.472400 | 2.125201e-02 |
| Alcohol | -4.471385e-02 | 5.769617e-01 | -0.028398 | 0.066520 | -0.050377 | 0.205739 | 0.208102 | -0.048878 | -0.104747 | 0.302061 | 0.317181 | 0.322564 | -0.424998 | 1.000000 | 0.494941 | 0.389071 | -6.584705e-02 |
| Schooling | 2.029575e-01 | 4.881243e-01 | -0.026595 | 0.158324 | -0.129975 | 0.377328 | 0.378301 | -0.218795 | -0.197999 | 0.235277 | 0.428341 | 0.512410 | -0.451633 | 0.494941 | 1.000000 | 0.713257 | -3.831932e-02 |
| Life expectancy | 1.691442e-01 | 4.833298e-01 | -0.017603 | 0.204348 | -0.198178 | 0.461290 | 0.473636 | -0.559105 | -0.224903 | 0.214010 | 0.428644 | 0.562165 | -0.472400 | 0.389071 | 0.713257 | 1.000000 | -2.701514e-02 |
| Country_labeled | -3.430464e-18 | 2.695564e-02 | -0.013738 | -0.013638 | 0.011017 | 0.021049 | -0.005377 | 0.094499 | -0.007684 | 0.051178 | -0.021682 | -0.001358 | 0.021252 | -0.065847 | -0.038319 | -0.027015 | 1.000000e+00 |
plt.figure(figsize=(20,20))
sns.heatmap(dcorr,annot=True,cmap='magma',linewidths=0.3,annot_kws={'fontsize':12})
plt.title('Correlation mapping between columns',fontsize=20)
plt.show()
plot = sns.clustermap(dcorr,annot=True,cmap='magma',linewidths=0.3,figsize=(20,20),annot_kws={'fontsize':15})
plot.fig.subplots_adjust(top=.95)
plot.fig.suptitle('Clustering with Correlation mapping',size=20)
Text(0.5, 0.98, 'Clustering with Correlation mapping')
Analysis: From the cluster map we can see Country_labeled is highly correlated with HIV/AIDs, Measles is highly correlated with under five deaths, Alcohol is correlated with Status, Schooling is highly correlated with Life expectancy and Diphtheria is highly correlated with Polio.
plt.figure(figsize=(15,10))
dcorr['Country_labeled'][:-1].sort_values().plot(kind='bar',color='purple')
plt.title('Country Correlation with other features',size=20)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.xlabel('Features',fontdict={'size':20})
plt.ylabel('Correlation',fontdict={'size':20})
plt.show()
Analysis: As mentioned earlier, Country is highly correlated with HIV/AIDS. After that Country is correlated with Total expenditure and Status. Here we can see there is almost 0 correlation between Country and Year.
d_year=df.groupby('Year').mean()
d_year
| Status | Population | Hepatitis B | Measles | Polio | Diphtheria | HIV/AIDS | under-five deaths | Total expenditure | GDP | BMI | thinness 1-19 years | Alcohol | Schooling | Life expectancy | Country_labeled | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Year | ||||||||||||||||
| 2000 | 0.174157 | 1.005478e+07 | 78.193082 | 4242.825843 | 76.511498 | 74.434574 | 2.536517 | 49.241573 | 5.615856 | 5293.117865 | 34.760712 | 5.113452 | 4.596061 | 10.682593 | 66.879213 | 88.5 |
| 2001 | 0.174157 | 1.051882e+07 | 77.407871 | 4234.921348 | 78.185655 | 76.940192 | 2.580337 | 47.859551 | 5.614844 | 5408.110000 | 36.387117 | 4.923564 | 4.587803 | 11.100009 | 67.301124 | 88.5 |
| 2002 | 0.174157 | 8.092535e+06 | 78.928199 | 2843.601124 | 79.889688 | 79.117431 | 2.582022 | 46.426966 | 5.692373 | 5176.790678 | 37.365206 | 5.138733 | 4.692073 | 11.276413 | 67.503371 | 88.5 |
| 2003 | 0.174157 | 1.058643e+07 | 75.959763 | 3384.915730 | 80.518901 | 79.038780 | 2.556180 | 45.011236 | 5.837622 | 5322.803459 | 36.671386 | 4.896036 | 4.718477 | 11.430908 | 67.548876 | 88.5 |
| 2004 | 0.174157 | 1.629304e+07 | 79.007738 | 2479.758427 | 80.339126 | 80.212937 | 2.488764 | 43.797753 | 5.851105 | 7304.755775 | 36.993858 | 4.911205 | 4.779769 | 11.653941 | 67.793820 | 88.5 |
| 2005 | 0.174157 | 1.747027e+07 | 80.891497 | 2583.713483 | 81.103171 | 81.791589 | 2.357303 | 42.269663 | 5.910375 | 7462.576899 | 35.797229 | 5.079182 | 4.866673 | 11.767986 | 68.308989 | 88.5 |
| 2006 | 0.174157 | 1.512488e+07 | 81.636963 | 1549.016854 | 84.621810 | 82.305907 | 2.180899 | 41.505618 | 5.676555 | 7538.734335 | 37.740487 | 5.098283 | 4.941679 | 11.991582 | 68.751124 | 88.5 |
| 2007 | 0.174157 | 1.768548e+07 | 81.617091 | 980.438202 | 84.846529 | 84.642985 | 2.005056 | 40.179775 | 5.671836 | 8614.558275 | 37.160712 | 5.017946 | 5.027466 | 12.093829 | 69.139326 | 88.5 |
| 2008 | 0.174157 | 9.935077e+06 | 83.508624 | 806.870787 | 85.588102 | 84.980064 | 1.809551 | 38.960674 | 5.730319 | 10335.106356 | 38.414083 | 4.890980 | 5.019938 | 12.216301 | 69.564045 | 88.5 |
| 2009 | 0.174157 | 9.684164e+06 | 84.052274 | 1262.331461 | 85.279114 | 85.086806 | 1.550000 | 37.561798 | 6.542397 | 7251.491790 | 39.778690 | 4.770194 | 4.908477 | 12.336526 | 70.030337 | 88.5 |
| 2010 | 0.174157 | 1.235833e+07 | 80.759277 | 1690.544944 | 83.846529 | 82.732873 | 1.282022 | 36.488764 | 6.144532 | 7627.397039 | 38.676442 | 4.643227 | 4.958084 | 12.462930 | 70.172472 | 88.5 |
| 2011 | 0.174157 | 1.191125e+07 | 83.853920 | 1943.842697 | 84.780899 | 87.151685 | 1.100562 | 34.988764 | 5.964420 | 9608.847664 | 40.017454 | 4.630306 | 4.882859 | 12.610121 | 70.780337 | 88.5 |
| 2012 | 0.174157 | 1.241679e+07 | 82.926954 | 1219.887640 | 84.196629 | 85.910112 | 0.932584 | 33.752809 | 6.256467 | 10109.260742 | 40.406218 | 4.623003 | 4.157185 | 12.726975 | 71.039888 | 88.5 |
| 2013 | 0.174157 | 1.237397e+07 | 83.313732 | 1397.168539 | 85.398876 | 86.331461 | 0.785393 | 32.500000 | 6.297815 | 8461.828390 | 41.512397 | 4.669632 | 4.087354 | 12.823042 | 71.358427 | 88.5 |
| 2014 | 0.174157 | 1.915122e+07 | 82.920042 | 1502.297753 | 84.719101 | 83.994382 | 0.688764 | 31.314607 | 6.214276 | 9778.320193 | 41.470263 | 4.573003 | 3.338196 | 12.876413 | 71.657303 | 88.5 |
| 2015 | 0.174157 | 1.169611e+07 | 82.251072 | 1207.179775 | 83.089888 | 84.550562 | 0.667416 | 30.140449 | 5.948435 | 7336.681547 | 42.901723 | 4.576935 | 4.660822 | 12.912930 | 71.729775 | 88.5 |
plt.figure(figsize=(20,10))
X=d_year.index
ax=sns.barplot(x=X,y='Schooling',data=d_year,palette="magma",ci=None)
plt.title('Year vs Schooling',fontdict={'size':20})
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.xlabel('Year',fontdict={'size':20})
plt.ylabel('Schooling',fontdict={'size':20})
plt.show()
Analysis: We can see that as the year passed by Schooling has increased.
plt.figure(figsize=(20,10))
X=d_year.index
ax=sns.barplot(x=X,y='Population',data=d_year,palette="magma",ci=None)
plt.title('Year vs Population',fontdict={'size':20})
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.xlabel('Year',fontdict={'size':20})
plt.ylabel('Population',fontdict={'size':20})
plt.show()
Analysis: Here there is no specific trend but we can see that maximum population is on 2014 and minimum population is on 2002 based on the given data.
plt.figure(figsize=(20,10))
X=d_year.index
ax=sns.barplot(x=X,y='Hepatitis B',data=d_year,palette="magma",ci=None)
plt.title('Year vs Hepatitis B',fontdict={'size':20})
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.xlabel('Year',fontdict={'size':20})
plt.ylabel('Hepatitis B',fontdict={'size':20})
plt.show()
Analysis: Here we can see that trend in Hepatitis B in respect to Year has been similar. It has ranged between 75 and 85 only in all the years.
plt.figure(figsize=(20,10))
X=d_year.index
ax=sns.barplot(x=X,y='Measles',data=d_year,palette="magma",ci=None)
plt.title('Year vs Measles',fontdict={'size':20})
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.xlabel('Year',fontdict={'size':20})
plt.ylabel('Measles',fontdict={'size':20})
plt.show()
Analysis: Here we can see that earlier Measles was very high and then starting going down till 2008 and then starts slowly rising till 2011. After 2011 there has been an irregular trend.
plt.figure(figsize=(20,10))
X=d_year.index
ax=sns.barplot(x=X,y='Polio',data=d_year,palette="magma",ci=None)
plt.title('Year vs Polio',fontdict={'size':20})
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.xlabel('Year',fontdict={'size':20})
plt.ylabel('Polio',fontdict={'size':20})
plt.show()
Analysis: Here we can see Polio has a very similar trend with respect to year. It ranges between 75 and 90.
plt.figure(figsize=(20,10))
X=d_year.index
ax=sns.barplot(x=X,y='Diphtheria',data=d_year,palette="magma",ci=None)
plt.title('Year vs Diphtheria',fontdict={'size':20})
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.xlabel('Year',fontdict={'size':20})
plt.ylabel('Diphtheria',fontdict={'size':20})
plt.show()
Analysis: Here we can see Diphtheria has a very similar trend with respect to year. It ranges between 70 and 90.
plt.figure(figsize=(20,10))
X=d_year.index
ax=sns.barplot(x=X,y='HIV/AIDS',data=d_year,palette="magma",ci=None)
plt.title('Year vs HIV/AIDS',fontdict={'size':20})
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.xlabel('Year',fontdict={'size':20})
plt.ylabel('HIV/AIDS',fontdict={'size':20})
plt.show()
Analysis: Here we can HIV/AIDS has gone down in subsequent years and was at it was maximum in 2001 and minimum in 2015.
plt.figure(figsize=(20,10))
X=d_year.index
ax=sns.barplot(x=X,y='under-five deaths',data=d_year,palette="magma",ci=None)
plt.title('Year vs under-five deaths',fontdict={'size':20})
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.xlabel('Year',fontdict={'size':20})
plt.ylabel('under-five deaths',fontdict={'size':20})
plt.show()
Analysis: Here we can see that under-five deaths has been strictly decreasing from 2000 to 2015.
plt.figure(figsize=(20,10))
X=d_year.index
ax=sns.barplot(x=X,y='Total expenditure',data=d_year,palette="magma",ci=None)
plt.title('Year vs Total expenditure',fontdict={'size':20})
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.xlabel('Year',fontdict={'size':20})
plt.ylabel('Total expenditure',fontdict={'size':20})
plt.show()
Analysis: Here we can see Total expenditure follow similar trend between 2001 and 2008 and then it rises and follows similar trend again between 2010 and 2015.
plt.figure(figsize=(20,10))
X=d_year.index
ax=sns.barplot(x=X,y='GDP',data=d_year,palette="magma",ci=None)
plt.title('Year vs GDP',fontdict={'size':20})
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.xlabel('Year',fontdict={'size':20})
plt.ylabel('GDP',fontdict={'size':20})
plt.show()
Analysis: Here we can see GDP has a very irregular pattern with respect to Year. It was maximum in 2008 and minimum in 2002.
plt.figure(figsize=(20,10))
X=d_year.index
ax=sns.barplot(x=X,y='BMI',data=d_year,palette="magma",ci=None)
plt.title('Year vs BMI',fontdict={'size':20})
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.xlabel('Year',fontdict={'size':20})
plt.ylabel('BMI',fontdict={'size':20})
plt.show()
Analysis: Here we can see BMI has slightly increased with respect to Years. It ranged between 30 and 50.
plt.figure(figsize=(20,10))
X=d_year.index
ax=sns.barplot(x=X,y='thinness 1-19 years',data=d_year,palette="magma",ci=None)
plt.title('Year vs thinness 1-19 years',fontdict={'size':20})
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.xlabel('Year',fontdict={'size':20})
plt.ylabel('thinness 1-19 years',fontdict={'size':20})
plt.show()
Analysis: Here we can see thinness has slightly decreased with respect to Years. It ranges between 4.5 and 6.
plt.figure(figsize=(20,10))
X=d_year.index
ax=sns.barplot(x=X,y='Alcohol',data=d_year,palette="magma",ci=None)
plt.title('Year vs Alcohol',fontdict={'size':20})
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.xlabel('Year',fontdict={'size':20})
plt.ylabel('Alcohol',fontdict={'size':20})
plt.show()
Analysis: Here we can Alcohol had a similar trend between Year 2000 and 2011 and then it started decreasing till 2014 and spiked up in 2015.
plt.figure(figsize=(20,10))
X=d_year.index
ax=sns.barplot(x=X,y='Life expectancy',data=d_year,palette="magma",ci=None)
plt.title('Year vs Life expectancy',fontdict={'size':20})
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.xlabel('Year',fontdict={'size':20})
plt.ylabel('Life expectancy',fontdict={'size':20})
plt.show()
Analysis: Here we can see Life Expectancy has been strictly increasing with respect to Years.
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world.head(5)
C:\Users\Lenovo\AppData\Local\Temp\ipykernel_3500\4109974910.py:1: FutureWarning: The geopandas.dataset module is deprecated and will be removed in GeoPandas 1.0. You can get the original 'naturalearth_lowres' data from https://www.naturalearthdata.com/downloads/110m-cultural-vectors/.
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
| pop_est | continent | name | iso_a3 | gdp_md_est | geometry | |
|---|---|---|---|---|---|---|
| 0 | 889953.0 | Oceania | Fiji | FJI | 5496 | MULTIPOLYGON (((180.00000 -16.06713, 180.00000... |
| 1 | 58005463.0 | Africa | Tanzania | TZA | 63177 | POLYGON ((33.90371 -0.95000, 34.07262 -1.05982... |
| 2 | 603253.0 | Africa | W. Sahara | ESH | 907 | POLYGON ((-8.66559 27.65643, -8.66512 27.58948... |
| 3 | 37589262.0 | North America | Canada | CAN | 1736425 | MULTIPOLYGON (((-122.84000 49.00000, -122.9742... |
| 4 | 328239523.0 | North America | United States of America | USA | 21433226 | MULTIPOLYGON (((-122.84000 49.00000, -120.0000... |
merged_data = world.merge(df, left_on='name', right_on='Country')
merged_data.head(5)
| pop_est | continent | name | iso_a3 | gdp_md_est | geometry | Country | Year | Status | Population | ... | HIV/AIDS | under-five deaths | Total expenditure | GDP | BMI | thinness 1-19 years | Alcohol | Schooling | Life expectancy | Country_labeled | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 889953.0 | Oceania | Fiji | FJI | 5496 | MULTIPOLYGON (((180.00000 -16.06713, 180.00000... | Fiji | 2015 | 0 | 892149.0 | ... | 0.1 | 0 | 5.935577 | 4921.896290 | 62.7 | 4.0 | 4.638932 | 15.3 | 69.9 | 53 |
| 1 | 889953.0 | Oceania | Fiji | FJI | 5496 | MULTIPOLYGON (((180.00000 -16.06713, 180.00000... | Fiji | 2014 | 0 | 88586.0 | ... | 0.1 | 0 | 4.490000 | 546.372800 | 61.9 | 4.0 | 0.010000 | 15.2 | 69.7 | 53 |
| 2 | 889953.0 | Oceania | Fiji | FJI | 5496 | MULTIPOLYGON (((180.00000 -16.06713, 180.00000... | Fiji | 2013 | 0 | 879715.0 | ... | 0.1 | 0 | 4.260000 | 4763.689560 | 61.1 | 4.0 | 0.010000 | 15.0 | 69.6 | 53 |
| 3 | 889953.0 | Oceania | Fiji | FJI | 5496 | MULTIPOLYGON (((180.00000 -16.06713, 180.00000... | Fiji | 2012 | 0 | 873596.0 | ... | 0.1 | 0 | 4.160000 | 4546.738540 | 6.2 | 4.0 | 0.010000 | 14.9 | 69.4 | 53 |
| 4 | 889953.0 | Oceania | Fiji | FJI | 5496 | MULTIPOLYGON (((180.00000 -16.06713, 180.00000... | Fiji | 2011 | 0 | 86786.0 | ... | 0.1 | 0 | 4.600000 | 4353.121392 | 59.5 | 4.0 | 0.010000 | 14.7 | 69.2 | 53 |
5 rows × 24 columns
fig, ax = plt.subplots(figsize=(30, 25))
merged_data.plot(column='Schooling', cmap='magma', linewidth=0.8, ax=ax, edgecolor='0.8', legend=True,legend_kwds={'shrink': 0.5})
ax.set_title('Schooling distribution in the world',fontdict={'fontsize':30})
ax.set_axis_off()
plt.show()
Analysis: Here we can see that maximum schooling is in Australia and minimum schooling is in Turkmenistan in Asia and Niger in Africa based on our data.
fig, ax = plt.subplots(figsize=(30, 25))
merged_data.plot(column='Population', cmap='magma', linewidth=0.8, ax=ax, edgecolor='0.8', legend=True,legend_kwds={'shrink': 0.5},vmin=0, vmax=50000000)
ax.set_title('Population distribution in the world',fontdict={'fontsize':30})
ax.set_axis_off()
plt.show()
Analysis: Here we can see Brazil, Ukraine, Philippines, Pakistan and Bangladesh have a population above 50 million people as per our given data.
fig, ax = plt.subplots(figsize=(30, 25))
merged_data.plot(column='Hepatitis B', cmap='magma', linewidth=0.8, ax=ax, edgecolor='0.8', legend=True,legend_kwds={'shrink': 0.5})
ax.set_title('Hepatitis B distribution in the world',fontdict={'fontsize':30})
ax.set_axis_off()
plt.show()
Analysis: Here we can see USA, New Zealand, Ukraine, Belarus, Phillipines, Yemen and French Guiana have very low Hepatitis B and Mexico, Cuba, Egypt, Lithuania, Poland, Slovakia, Oman and Kazakhstan have very high Hepatitis B as per the given data.
fig, ax = plt.subplots(figsize=(30, 25))
merged_data.plot(column='Measles', cmap='magma', linewidth=0.8, ax=ax, edgecolor='0.8', legend=True,legend_kwds={'shrink': 0.5},vmin=0, vmax=4000)
ax.set_title('Measles distribution in the world',fontdict={'fontsize':30})
ax.set_axis_off()
plt.show()
Analysis: We can see Australia, North America and almost entire South America is Measles free and India, Nepal, Thailand, Vietnam, Philippines, Malaysia, Japan, Turkey, Eastern costal Africa and France have maximum Measles based on our data.
fig, ax = plt.subplots(figsize=(30, 25))
merged_data.plot(column='Polio', cmap='magma', linewidth=0.8, ax=ax, edgecolor='0.8', legend=True,legend_kwds={'shrink': 0.5})
ax.set_title('Polio distribution in the world',fontdict={'fontsize':30})
ax.set_axis_off()
plt.show()
Analysis: Here we can see USA, Australia, Namibia, Angola, Chad, Kenya, Suriname, Costa Rica and Guatemala are almost Polio free and Mexico, Brazil, almost entire Europe, Botswana, Egypt, Japan, Thailand, Malaysia, Oman, Uzbekistan, Iceland and Turkmenistan have very high Polio as per our given data.
fig, ax = plt.subplots(figsize=(30, 25))
merged_data.plot(column='Diphtheria', cmap='magma', linewidth=0.8, ax=ax, edgecolor='0.8', legend=True,legend_kwds={'shrink': 0.5})
ax.set_title('Diphtheria distribution in the world ',fontdict={'fontsize':30})
ax.set_axis_off()
plt.show()
Analysis: Here we can see Australia, New Zealand, Uruguay, Mozambique, Iraq, Germany, Norway and Geogia are Diphtheria free and Mexico, Brazil, Peru, Oman, Egypt, Thailand, Malaysia, Botswana and most of northern Europe have maximum Diphtheria as per our given data.
fig, ax = plt.subplots(figsize=(30, 25))
merged_data.plot(column='HIV/AIDS', cmap='magma', linewidth=0.8, ax=ax, edgecolor='0.8', legend=True,legend_kwds={'shrink': 0.5})
ax.set_title('HIV/AIDS distribution in the world',fontdict={'fontsize':30})
ax.set_axis_off()
plt.show()
Analysis: Here we can see majority countries are HIV/AIDS free and maximum HIV/AIDS is in Southern Africa-Zimbabwe as per our data.
fig, ax = plt.subplots(figsize=(30, 25))
merged_data.plot(column='under-five deaths', cmap='magma', linewidth=0.8, ax=ax, edgecolor='0.8', legend=True,legend_kwds={'shrink': 0.5})
ax.set_title('Under-five deaths distribution in the world',fontdict={'fontsize':30})
ax.set_axis_off()
plt.show()
Analysis: Majority countries in the world have minimum under five deaths and India has the maximum under five deaths as per our data.
fig, ax = plt.subplots(figsize=(30, 25))
merged_data.plot(column='Total expenditure', cmap='magma', linewidth=0.8, ax=ax, edgecolor='0.8', legend=True,legend_kwds={'shrink': 0.5},vmin=0, vmax=14)
ax.set_title('Total expenditure distribution in the world',fontdict={'fontsize':30})
ax.set_axis_off()
plt.show()
Analysis: Here we can see Indonesia, Bangladesh, Myanmar and Germany habe minimum Total expenditure and USA and Leone have the maximum Total expenditure as per the given data.
fig, ax = plt.subplots(figsize=(30, 25))
merged_data.plot(column='GDP', cmap='magma', linewidth=0.8, ax=ax, edgecolor='0.8', legend=True,legend_kwds={'shrink': 0.5},vmin=0, vmax=10000)
ax.set_title('GDP distribution in the world',fontdict={'fontsize':30})
ax.set_axis_off()
plt.show()
Analysis: Here we can see maximum GDP is in Canada, Japan, Norway, Sweden, Finland, Iceland, Ireland, New Zealand, France, Spain, Germany and Netherlands and minimum GDP is in southern Northern America, most of Africa, southern Asia, Chile, Portugal, Italy, Greece and Ukraine as per the given data.
fig, ax = plt.subplots(figsize=(30, 25))
merged_data.plot(column='BMI', cmap='magma', linewidth=0.8, ax=ax, edgecolor='0.8', legend=True,legend_kwds={'shrink': 0.5},vmin=0, vmax=60)
ax.set_title('BMI distribution in the world',fontdict={'fontsize':30})
ax.set_axis_off()
plt.show()
Analysis: Here we can see maximum BMI is in Australia, Canada and New Zealand and minimum BMI is in USA, South Africa, Egypt, Portugal, Liberia, Cameroon, Austia, Slovakia and Switzerland as per the given data.
fig, ax = plt.subplots(figsize=(30, 25))
merged_data.plot(column='thinness 1-19 years', cmap='magma', linewidth=0.8, ax=ax, edgecolor='0.8', legend=True,legend_kwds={'shrink': 0.5})
ax.set_title('Thinness distrubution of 1 to 19 years in the world',fontdict={'fontsize':30})
ax.set_axis_off()
plt.show()
Analysis: Here we can see the maximum Thin people between 1 to 19 years is in India and minimum is in entire North America, South America, Australia, New Zealand, entire Europe, Japan, Mozabique, Angola, Mauritania, Afghanistan, Uzbekistan, Turkmenistan, Kazakistan, Azerbaijan and Mongolia as per the given data.
fig, ax = plt.subplots(figsize=(30, 25))
merged_data.plot(column='Alcohol', cmap='magma', linewidth=0.8, ax=ax, edgecolor='0.8', legend=True,legend_kwds={'shrink': 0.5},vmin=0, vmax=10)
ax.set_title('Alcohol distribution in the world',fontdict={'fontsize':30})
ax.set_axis_off()
plt.show()
Analysis: Here we can see maximum Alcohol is in Australia and most of Europe and minimum Alcohol is in northern Africa, Arabic countries, southern Asian countries, Estonia, Madagascar, Mozambique and Somalia as per the given data.
fig, ax = plt.subplots(figsize=(30, 25))
merged_data.plot(column='Life expectancy', cmap='magma', linewidth=0.8, ax=ax, edgecolor='0.8', legend=True,legend_kwds={'shrink': 0.5},vmin=40, vmax=80)
ax.set_title('Life expectancy distribution in the world',fontdict={'fontsize':30})
ax.set_axis_off()
plt.show()
Analysis: Here we can see maximum Life Expectancy is in Canada, Australia, Japan, Sweden and Spain and minimum Life Expectancy is in Leone, Angola, Zambia, Zimbabwe, Malawi, Uganda and Eritrea as per the given data.
fig, ax = plt.subplots(figsize=(30, 25))
merged_data.plot(column='Status', cmap='magma', linewidth=0.8, ax=ax, edgecolor='0.8', legend=True,legend_kwds={'shrink': 0.5})
ax.set_title('Developed vs Developing distribution in the world',fontdict={'fontsize':30})
ax.set_axis_off()
plt.show()
Analysis: Here we can see USA, Australia, New Zealand, Japan, Ireland, Iceland, Portugal, Spain, Germany, Switzerland, Belgium, Netherlands, Austria, Italy, Slovakia, Poland, Bulgaria, Hungary, Slovenia, Croatia and Belarus are the developed countries and rest are the developing countries.
plot=sns.jointplot(x='BMI',y='Life expectancy',data=df,color='orange',kind="hist",height=10)
plot.fig.subplots_adjust(top=.95)
plot.fig.suptitle('BMI vs Life Expectancy',size=20)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.xlabel('BMI',fontdict={'size':20})
plt.ylabel('Life Expectancy',fontdict={'size':20})
plt.show()
Analysis: Here we can see in majority Life Expectancy is between 70 and 80 where BMI is between 50 and 60.
plot=sns.jointplot(x='Polio',y='Diphtheria',data=df,color='black',kind="hist",height=10)
plot.fig.subplots_adjust(top=.95)
plot.fig.suptitle('Polio vs Diphtheria',size=20)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.xlabel('Polio',fontdict={'size':20})
plt.ylabel('Diphtheria',fontdict={'size':20})
plt.show()
Analysis: Here we can see majority Diptheria is 100 where Polio is also 100.
plt.figure(figsize=(20,10))
ax=sns.countplot(x='Status',data=df,palette="magma")
plt.title('Status vs Frequency',fontdict={'size':20})
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.xlabel('Status',fontdict={'size':20})
plt.ylabel('Frequency',fontdict={'size':20})
plt.show()
Analysis: Here we can see that majority of the data contains developing data that developed data and we also had seen in the world plot that majority countries were developing.
Here we would be using Neural Networks in order to perform Regression to predict Life Expectancy.
X=df.drop(['Country','Life expectancy'],axis=1).values
Y=df['Life expectancy'].values
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.05, random_state=124)
scaler=MinMaxScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)
X_train.shape
(2705, 16)
It has 16 columns which would be beneficial to create the neural network.
model=Sequential()
model.add(Dense(64,activation='relu'))
model.add(Dense(32,activation='relu'))
model.add(Dense(1))
model.compile(optimizer='adam',loss='mse')
model.fit(x=X_train,y=y_train,validation_data=(X_test,y_test),batch_size=128,epochs=500,verbose=0)
<keras.callbacks.History at 0x1dda10b55e0>
losses=pd.DataFrame(model.history.history)
plt.figure(figsize=(10,10))
plt.plot(losses)
plt.legend(losses)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Epochs vs Loss')
plt.show()
Analysis: Here we can see our normal Loss Line and Validation Line are retracing each other and hence our model is perfectly fit.
predictions = model.predict(X_test)
5/5 [==============================] - 0s 2ms/step
mae=metrics.mean_absolute_error(y_test,predictions)
print("Mean Absolute error:",mae)
Mean Absolute error: 2.3263136643629814
mse=metrics.mean_squared_error(y_test,predictions)
print("Mean Squared error:",mse)
Mean Squared error: 9.737329606455482
rmse=mse**0.5
print("Root Mean Squared error:",rmse)
Root Mean Squared error: 3.120469452895747
Here our Root Mean Squared error is aprroximately 3 which proves our model is perfectly fit for the model.
Here we are taking 2012 India data as an example.
df[df['Country']=='India'].iloc[3]
Country India Year 2012 Status 0 Population 126365852.0 Hepatitis B 73.0 Measles 18668 Polio 79.0 Diphtheria 82.0 HIV/AIDS 0.2 under-five deaths 1400 Total expenditure 4.39 GDP 1446.98541 BMI 17.0 thinness 1-19 years 26.9 Alcohol 3.1 Schooling 11.3 Life expectancy 67.3 Country_labeled 71 Name: 1139, dtype: object
X_India=np.array([[2012,0,126365852.0,73.0,18668,79.0,82.0,0.2,1400,4.39,1466.981541,17.0,26.9,3.1,11.3,71]])
X_India.shape
(1, 16)
X_India=scaler.fit_transform(X_India)
Y_India=model.predict(X_India)
1/1 [==============================] - 0s 23ms/step
print("Predicted: ",Y_India)
Predicted: [[60.52883]]
Actual years is 67 years but model is approximately giving me 61 years and hence my model is implemented successfully.
Here we will be creating Neural Networks to implement Binary Classification to predict whether is Developed or Developing.
X=df.drop(['Country','Status'],axis=1).values
Y=df['Status'].values
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=124)
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)
X_train.shape
(1993, 16)
model=Sequential()
model.add(Dense(16,activation='relu'))
model.add(Dense(8,activation='relu'))
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='rmsprop',loss='binary_crossentropy')
model.fit(x=X_train,y=y_train,validation_data=(X_test,y_test),batch_size=128,epochs=43,verbose=0)
<keras.callbacks.History at 0x1dda217aaf0>
losses=pd.DataFrame(model.history.history)
plt.figure(figsize=(10,10))
plt.legend(losses)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Epochs vs Loss')
plt.plot(losses)
plt.show()
Analysis: Here we can see Validation Line has been below our Loss line hence our model is perfectly fit.
predictions = (model.predict(X_test)>0.5).astype("int32")
27/27 [==============================] - 0s 1ms/step
print(confusion_matrix(y_test,predictions))
[[671 33] [ 30 121]]
print(classification_report(y_test,predictions))
precision recall f1-score support
0 0.96 0.95 0.96 704
1 0.79 0.80 0.79 151
accuracy 0.93 855
macro avg 0.87 0.88 0.87 855
weighted avg 0.93 0.93 0.93 855
Since our model is able to achieve 93% accuracy using this neural network, our model has successfully predicted the Status of the country.
Here we will be creating Neural Network to implement Multiclass Classification to predict the country.
df_count=df['Country_labeled']
one_hot=pd.get_dummies(df_count,columns=['Country'])
one_hot.head(1)
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 rows × 178 columns
X=df.drop(['Country','Year','Country_labeled'],axis=1)
Y=one_hot
X=pd.get_dummies(X,columns=['under-five deaths'])
We are also one hot encoding under five deaths because it will help the model in further predicting the Country.
X.shape
(2848, 241)
Y.shape
(2848, 178)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.025, random_state=124)
Test size is 0.025 because here we have 178 unique features and 2848 rows in Y and so maximum test data can be 178/2848 = 0.0625 or 6.25%. So just to achieve better accuracy 0.025 or 2.5 % of the data is taken as test size.
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)
model=Sequential()
model.add(Dense(241,activation='relu'))
model.add(Dropout(0.6))
model.add(Dense(175,activation='relu'))
model.add(Dropout(0.6))
model.add(Dense(178,activation='softmax'))
model.compile(optimizer='rmsprop',loss='categorical_crossentropy')
model.fit(x=X_train,y=y_train,validation_data=(X_test,y_test),batch_size=1024,epochs=1750,verbose=0)
<keras.callbacks.History at 0x1ddaf774700>
losses=pd.DataFrame(model.history.history)
plt.figure(figsize=(10,10))
plt.legend(losses)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Epochs vs Loss')
plt.plot(losses)
plt.show()
Analysis: Here we can see Validation Line has been below our Loss line hence our model is perfectly fit.
predictions = model.predict(X_test)
3/3 [==============================] - 0s 2ms/step
predictions.shape
(72, 178)
Here the predictions contain predicted probabilities for each Country for each row of testing data.
res=np.argmax(predictions,axis=1)
res
array([128, 135, 25, 7, 110, 76, 144, 39, 7, 145, 151, 126, 128,
75, 38, 148, 152, 142, 29, 75, 160, 52, 152, 73, 90, 139,
17, 5, 85, 15, 37, 69, 166, 119, 17, 94, 53, 142, 40,
45, 172, 14, 27, 175, 2, 162, 84, 54, 158, 163, 150, 13,
126, 15, 96, 38, 32, 98, 123, 30, 126, 35, 84, 60, 87,
65, 154, 116, 15, 165, 21, 77], dtype=int64)
Now we have got our predicted country for each row of test data.
y_test.shape
(72, 178)
Since we had one hot encoded it earlier we need the labels back for each row of it.
columns_with_true = y_test.idxmax(axis=1)
real=columns_with_true.values
real
array([128, 135, 25, 7, 110, 36, 144, 39, 7, 145, 151, 126, 128,
75, 38, 148, 152, 142, 29, 75, 160, 52, 152, 162, 90, 139,
17, 5, 85, 15, 37, 69, 166, 85, 17, 94, 53, 142, 40,
45, 172, 14, 27, 175, 2, 162, 84, 54, 158, 163, 21, 13,
126, 15, 96, 38, 32, 98, 123, 30, 126, 89, 84, 60, 87,
65, 9, 116, 15, 165, 21, 77], dtype=int64)
Now we have got the actual Country for each row of test data.
plt.figure(figsize=(20,20))
cmat=confusion_matrix(real,res)
plt.title('Actual vs Predicted Country',fontsize=20)
sns.heatmap(cmat,cmap='magma')
plt.show()
Here we can see we are getting 6 countries wrong which is very less in respect to 72 rows we had taken for testing. Here the colour tells that how much data was for that country in the test data.
report=classification_report(real,res)
f1_score = report.split('\n')[66].split()[1]
print(f"F1 Accuracy: {f1_score}")
F1 Accuracy: 0.92
C:\Users\Lenovo\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\Lenovo\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1318: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\Lenovo\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\Lenovo\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1318: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\Lenovo\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Users\Lenovo\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1318: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
Here we can see our model is able to achieve 92% accuracy from this neural network, hence our model has successfully predicted the Country.